Syllables and other String Kernel
نویسندگان
چکیده
Recently, the use of string kernels that compare documents as a string of letters has been shown to achieve good results on text classiication problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents and as a result reduces computation time. Moreover syllables provide a more natural representation of text; rather than the traditional coarse representation given by the bag-of-words, or the too ne one resulting from considering individual letters only. We give some experimental results which show that syllables can be eeectively used in text-categorisation problems. In this paper we also propose two extensions to the string kernel. The rst introduces a new lambda-weighting scheme, where diierent symbols can be given diiering decay weightings. This may be useful in text and other applications where the insertion of certain symbols may be known to be less signiicant. We also introduce the concept of`soft matching', where symbols can match (possibly weighted by relevance) even if they are not identical. Again, this provides a method of incorporating prior knowledge where certain symbols can be regarded as a partial or exact match and contribute to the overall similarity measure for two data items.
منابع مشابه
Syllables and other String Kernel Extensions
During the last years, the use of string kernels that compare documents has been shown to achieve good results on text classification problems. In this paper we introduce the application of the string kernel in conjunction with syllables. Using syllables shortens the representation of documents compared to a character based representation and as a result reduces computation time. Moreover sylla...
متن کاملThe Spectrum Kernel: A String Kernel for SVM Protein Classification
We introduce a new sequence-similarity kernel, the spectrum kernel, for use with support vector machines (SVMs) in a discriminative approach to the protein classification problem. Our kernel is conceptually simple and efficient to compute and, in experiments on the SCOP database, performs well in comparison with state-of-the-art methods for homology detection. Moreover, our method produces an S...
متن کاملPosition-Aware String Kernels with Weighted Shifts and a General Framework to Apply String Kernels to Other Structured Data
In combination with efficient kernel-base learning machines such as Support Vector Machine (SVM), string kernels have proven to be significantly effective in a wide range of research areas (e.g. bioinformatics, text analysis, voice analysis). Many of the string kernels proposed so far take advantage of simpler kernels such as trivial comparison of characters and/or substrings, and are classifie...
متن کاملLearning state machine-based string edit kernels
During the past few years, several works have been done to derive string kernels from probability distributions. For instance, the Fisher kernel uses a generative model M (e.g. a hidden markov model) and compares two strings according to how they are generated by M . On the other hand, the marginalized kernels allow the computation of the joint similarity between two instances by summing condit...
متن کاملString Subsequence Kernels for Text Classification
This paper explores the string subsequence kernel, a kernel function whose feature space is generated by subsequences of strings. This kernel compares two strings based on the number of occurrences of common substrings they contain, where each common substring is weighted based on how contiguous that substring is within the string. Although a recursive definition of the string subsequence kerne...
متن کامل